$LCSk$++: Practical similarity metric for long strings
نویسندگان
چکیده
In this paper we present LCSk++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants and animals, classic algorithms such as Longest Common Subsequence (LCS) fail due to demanding computational complexity. Recently, Benson et al. defined a similarity metric named LCSk. By relaxing the requirement that the k-length substrings should not overlap, we extend their definition into a new metric. An efficient algorithm is presented which computes LCSk++ with complexity of O((|X| + |Y |) log(|X| + |Y |)) for strings X and Y under a realistic random model. The algorithm has been designed with implementation simplicity in mind. Additionally, we describe how it can be adjusted to compute LCSk as well, which gives an improvement of the O(|X||̇Y |) algorithm presented in the original LCSk paper.
منابع مشابه
Fast and simple algorithms for computing both $LCS_{k}$ and $LCS_{k+}$
Longest Common Subsequence (LCS) deals with the problem of measuring similarity of two strings. While this problem has been analyzed for decades, the recent interest stems from a practical observation that considering single characters is often too simplistic. Therefore, recent works introduce the variants of LCS based on shared substrings of length exactly or at least k (LCSk and LCSk+ respect...
متن کاملPASS-JOIN: A Partition-based Method for Similarity Joins
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long stri...
متن کاملEmbedJoin: Eicient Edit Similarity Joins via Embeddings∗
We study the problem of edit similarity joins, where given a set of strings and a threshold value K , we want to output all pairs of strings whose edit distances are at most K . Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative ltering and natural language processing, and has been identied as a primitive operator for database systems. i...
متن کاملEfficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees
Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most appr...
متن کاملThe Analytic Technique and Experimental Research Methods of Post-buckling about Slender Rod Strings in Wellbore
The buckling behavior of rod strings in wellbore is one of the key issues in petroleum engineering. The slender rod strings in vertical wellbore were selected as research objects. Based on the energy method, the critical load formulas of sinusoidal and helical buckling were derived for the string with the bottom of the wellbore pressure. According to the sinusoidal and helical buckling’s geomet...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1407.2407 شماره
صفحات -
تاریخ انتشار 2014